In this project, a wine quality dataset set will be used to study the effect of different parameters on the red and white wine quality. The objective is to determine the top three predictors for the red and white wine quality. In fact, this is a predictive question, so this study uses a machine learning method to predict the wine quality (a target variable) based on the wine features. A tree classifier has been used for this purpose, and the top three predictors were determined using this classifier.
The wine quality data was obtained from UCI Machine Learning Repository [1], however, the original data was prepared by P. Cortez [2]. The wine data is divided into to datasets for the red and white variants of the Portuguese "Vinho Verde" wine. The most common physicochemical (features) and sensory (target) variables are available in these two datasets, and they have 12 with 1599 red and 4898 white examples totally [3]. The features include fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates and alcohol. The target variable is the wine quality which is defined as a numerical score between 0 (the worst wine) and 10 (the best wine). However, in these datasets there is no observation with a quality of lower than 3 and higher than 9. Table 1 shows the summary of feature statistics for each dataset.
Table 1. The summary of feature statistics for each dataset
| Feature | Red wine | White wine | ||||
|---|---|---|---|---|---|---|
| Min | Mean | Max | Min | Mean | Max | |
| Fixed acidity | 4.60 | 8.32 | 15.90 | 3.80 | 6.85 | 14.20 |
| Volatile acidity | 0.12 | 0.53 | 1.58 | 0.08 | 0.28 | 1.10 |
| Citric acid | 0.00 | 0.27 | 1.00 | 0.00 | 0.33 | 1.66 |
| Residual sugar | 0.90 | 2.54 | 15.50 | 0.60 | 6.39 | 65.80 |
| Chlorides | 0.01 | 0.09 | 0.61 | 0.01 | 0.05 | 0.35 |
| Free sulfur dioxid | 1.00 | 15.87 | 72.00 | 2.00 | 35.31 | 289.00 |
| Total sulfur dioxid | 6.00 | 46.47 | 289.00 | 9.0 | 138.4 | 440.0 |
| Density | 0.990 | 0.997 | 1.004 | 0.987 | 0.994 | 1.039 |
| pH | 2.74 | 3.31 | 4.01 | 2.72 | 3.19 | 3.82 |
| Sulphates | 0.33 | 0.66 | 2.00 | 0.22 | 0.49 | 1.08 |
| Alcohol | 8.40 | 10.42 | 289.00 | 8.00 | 10.51 | 14.20 |
| Quality | 3.00 | 5.64 | 8.00 | 3.00 | 5.88 | 9.00 |
The datasets have been checked to make sure that no missing element is present. The classes in the datasets for both red and white wine are not balanced. That is because there are much more normal wines than good or bad ones. Figure 1 shows the bar plot of the number of observations for different wine qualities in the red wine dataset. Figure 2 shows a similar bar plot for the white wine dataset (the number of observations for each quality have been given on top of the bars).
Figure 1. The bar plot of the number of observations for different wine qualities in the red wine dataset
Figure 2. The bar plot of the number of observations for different wine qualities in the white wine dataset
To overcome these problems, some of the values if the quality variables have combined together and the resulting values have been turned into categorical values. To goal was to combine the qualities with a very low number of observations. Two different patterns have been tried for this purpose. Figures 3 and 4 show how this transformation has been done for the first pattern.
Figure 3. The bar plot of the number of observations for different wine qualities in the cleaned red wine dataset (pattern 1)
Figure 4. The bar plot of the number of observations for different wine qualities in the cleaned white wine dataset (pattern 1)
Figures 5 and 6 show how this transformation has been done in the second pattern.
Figure 5. The bar plot of the number of observations for different wine qualities in the cleaned red wine dataset (pattern 2)
Figure 6. The bar plot of the number of observations for different wine qualities in the cleaned white wine dataset (pattern 2)
The effect of each wine feature on the quality has been studied using exploratory data visualization of the cleaned datasets. Figure 7 shows the violin and jitter plots of each feature versus red wine quality in the cleaned data (pattern 1). The error bars have been shown too. Figure 8 shows a similar plot for the white wine in the cleaned data (pattern 1).
Figure 7. The violin and jitter plot of each feature versus red wine quality (with error bars)
Figure 8. The violin and jitter plot of each feature versus white wine quality (with error bars)
These figures suggest that alcohol is probably the most important predictor for both red and white wines since the difference between the alcohol value of different qualities is larger than its standard error for both red and white wines. Figure 9 and 10 show the violin and jitter plots of each feature versus the wine quality in the cleaned data (pattern 2) for red and white wines.